In this Exploratory Data Analysis project, there are 1599 observations of 13 variables in red wine dataset. Histogram and boxplot plot is constructed for each variable to know how the values for each variable is distributed. In the dataset, all the variables are numeric except quality which is categorical variable that determines the quality of wine. I would like to analyze the dataset and determine which variables would affect the quality of wine and correlation among the variables.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Considering quality as important feature in wine analysis. The plot is as follows
Most of the wines are of average quality i.e at [5, 6].
Observations of variables: 1. Fixed.acidity has a median at 8 and have outliers at high range. 2. Volatile.acidity has a long tail which extends upto 1.58 with median 0.5. 3. Citric.acid plot looks different which has a median at 0.26 and there is no citric acid content in wines beyond 0.80. 4. Residual.sugar and chlorides looks similar and has a long tail on right side as well as many outliers. 5. free.sulphur.dioxide, sulphates and total.sulphur.dioxide has outliers at high ranges and has a long tail pattern. 6. Density and pH has normal distribution. Most of the wines has pH value in the range of [3, 3.5] 7. Alcohol has less outliers compared to sulphates, residual.sugar, volatile.acidity, chlorides and free/total.sulphur.dioxide. Is has positive skewed distribution. 8. Most of the values lie on 5, 6 and 7 from the range of 3 to 8 with median 6. 9. In the dataset, all the variables has outliers.
There are 1599 observations of 13 variables in the dataset and the dataset is tidy. X variable which does not give any information about wine is ignored. Among the 12 variables, 11 variables are numeric and one variable quality is categorical. I would like to do analysis on how these variables affect the quality of wine.
Quality is main feature of interest in dataset.
Alcohol, pH and residual.sugar may affect the quality of wine.
## X fixed.acidity volatile.acidity citric.acid
## X 1.00 -0.27 -0.01 -0.15
## fixed.acidity -0.27 1.00 -0.26 0.67
## volatile.acidity -0.01 -0.26 1.00 -0.55
## citric.acid -0.15 0.67 -0.55 1.00
## residual.sugar -0.03 0.11 0.00 0.14
## chlorides -0.12 0.09 0.06 0.20
## free.sulfur.dioxide 0.09 -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.12 -0.11 0.08 0.04
## density -0.37 0.67 0.02 0.36
## pH 0.14 -0.68 0.23 -0.54
## sulphates -0.13 0.18 -0.26 0.31
## alcohol 0.25 -0.06 -0.20 0.11
## quality 0.07 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## X -0.03 -0.12 0.09
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## X -0.12 -0.37 0.14 -0.13 0.25
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## X 0.07
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar"
## [1] "chlorides" "free.sulfur.dioxide" "total.sulfur.dioxide"
## [4] "density"
## [1] "pH" "sulphates" "alcohol" "quality"
A new variable is created to divide the quality into bins that is (2, 4] as bad, (4,6] as average and (6, 8] as good. This new variable will be helpful in analyzing multivariables.
##
## (2,4] (4,6] (6,8]
## 63 1319 217
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality ratingbucket
## 1 5 (4,6]
## 2 5 (4,6]
## 3 5 (4,6]
## 4 6 (4,6]
## 5 5 (4,6]
## 6 5 (4,6]
Based on correlation results the positively correlated plots are as follows
As the quality increases the mean(blue point) and median of fixed.acidity fluctuates. The plot shows that the fixed.acidity does not affect the quality of the wine. The correlation is positive.
The correlation between quality and residual.sugar is very low(0.013). There is no significant increase of residual.sugar as the quality increases.
Sulphates monotonically increases with quality of wine.
Among all the properties alcohol has highest correlation with quality of wine. Alcohol monotonically increases with quality.
Negatively correlated plots are as follows
volatile.acidity, chlorides decreases gradually with increase in quality of wine. free/total.sulphur.dioxide, density and pH fluctuates at 5 and 6 but decreases with increase in quality of wine.
Based on the correlation results, fixed.acidity, citric.acid, residual.sugar, sulphates and alcohol are positively correlated and volatile.acidity, chlorides, free/total.sulphur.dioxide, density and pH are negatively correlated with quality variable.
Citric.acid and density have good correlation with fixed.acidity(0.67). Free.sulphur.dioxide and total.sulphur.dioxide also have a good correlation (0.67). Sulphates gradually increase with the quality of wine. In general, pH less than 7 is acidic. The suprising result was pH value decreases as increase in quality of wine.
Based on research, alcohol is one of the important component in redwine that causes health issues.In this analysis process, it is shown that alcohol has very good correlation with quality.
Based on data visualization of bivariate analysis, alcohol has highest positive correlation with quality among other variables. By multivariate analysis we will check the effect of other variables with alcohol and quality.
The quality of wine decreases with decrease in fixed.acidity as the alcohol increases.
There is no significant increase in quality of wine with increase in alcohol content.
Residual.sugar has many outliers. For lower values of residual.sugar the quality of wine decreases.
The quality of wine is good for higher alcohol and sulphates.
Better quality wines are more acidic with the increase of alcohol content.
Both the plots have good correlation and produces better quality with the increase of x and y axis values.
For higher range of total.sulfur.dioxide and free.sulfur.dioxide there are better wines but also have few outliers.
Fixed.acidity and density have a good quality of wines. Quality monotonically increases with fixed.acidity and density. Yes, the amount of alcohol and pH content improves the quality of wine. But residual.sugar does not have good correlation with quality.
Citric.acid and fixed.acidity as well as density and fixed.acidity have better quality of wines at higher ranges. The quality of wine is good for higher ranges of alcohol and sulphate content.
Among all the variables in the dataset, I thought quality is one of the main feature to analyze how other chemical properties affect the quality of wine.
In general, if intake of alcohol is more then it leads to health issues. So I have assumed that there will be less alcohol content in wine. By analyzing the data, it has been shown that with the increase of alcohol content the quality of wine increases too. Hence I have chosen this plot to know the relationship between these two variables.
By the analysis of redwine dataset, it has been observed that if pH is acidic the quality of wine increases with alcohol content which made my assumption wrong.
On observing the redwine dataset I was not sure which properties would affect the quality of wine. After exploring the data by how each variable is distributed I thought of considering the quality as main feature and determine which variables would affect the quality of wine. I thought alcohol, pH and residual.sugar might help to determine the quality of wine. Residual.sugar did not go well in the analysis. The suprising result was the wines that are acidic have better quality. The alcohol content played important role in exploring the variable quality of redwine.
For future data analysis, I would like to have a dataset with different wine styles for example, fruit composition, rich and dark, long aging type and techniques used for wine making. This insight would help to explore the data and determine the quality of the wine styles.
[ https://www.hsph.harvard.edu › … › Drinks to Consume in Moderation]
[ http://www.sthda.com › … › R software › Data Visualization › ggplot2 - Essentials]
[https://campus.datacamp.com/courses/introduction-to-r-for-finance/factors-4?ex=8]
[https://onlinecourses.science.psu.edu/stat857/node/223]
[http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually]